Improve bulking in Gluon #13890

ptrendx · 2019-01-15T20:54:56Z

Description

This PR improves performance of hybridized Gluon models on the GPU by better bulking of ops (running GPU ops without synchronization).

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented:
For new C++ functions in header files, their functionalities and arguments are documented.
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

For models hybridized with hybridize() engine currently creates ImperativeBulk op that combines multiple operations into 1. However, this ImperativeBulk op captures entire ops, including synchronization at the end. Since the engine dependencies are only updated at the very end of the ImperativeBulk op, having this synchronization is wasteful. This PR changes the RunContext structure so that ImperativeBulk op is able to inform the bulked ops that it will handle the synchronization itself. Imperative ops check that argument to determine whether they need to perform their own synchronization.
For models hybridized with hybridize(static_alloc=True, static_Shape=True) there is currently another mechanism of bulking, similar to the one used in symbolic API. However, it very aggressively excludes ops from being eligible for bulking - any op that produces output from forward pass (which means most of the ops) cannot be bulked. This is artificial limitation and this PR lifts it.

Comments

With this PR hybridize(static_alloc=True) and hybridize(static_alloc=True, static_shape=True) become pretty similar in performance for single GPU tests. The main 2 differences seem to be:
- a difference in the time it takes to start the forward pass (which I did not yet investigate, but non-static shape seems to take more time to start)
- the fact that ImperativeBulk is a temporary ThreadedEngineOpr destroyed at the end of its invocation, which seems to be a relatively expensive operation (hybridized model with everything static mostly does not use ImperativeBulk so does not pay the cost of destruction of the object).
I will post some performance comparisons in a future comment.

@eric-haibin-lin @piiswrong

ptrendx · 2019-01-15T23:05:12Z

Perf comparison:
ResNet-50 v1d from GluonCV on synthetic data, using slightly modified script to enable synthetic data, DGX1-V 32G.

Default parameters from GluonCV script (128 per GPU, 8 GPUs, NAG optimizer)
- Without this PR: 5700 imgs/s
- With this PR: 5910 imgs/s
Small batch (32 per GPU), 1 GPU, NAG optimizer changed to SGD (as at such small batch lack of true mixed precision version of NAG optimizer skews results a lot):
- Without this PR: 570 imgs/s
- Without this PR (static_shape=False): 550 imgs/s
- With this PR: 620 imgs/s
- With this PR (static_shape=False): 600 imgs/s

ptrendx · 2019-01-15T23:30:28Z

Also adding @KellenSunderland

stu1130 · 2019-01-16T00:10:03Z

Thanks for your contribution @ptrendx
@mxnet-label-bot add [pr-awaiting-review]

ptrendx · 2019-01-16T21:41:28Z

The issue with test_dropout is unrelated, details in #9816

eric-haibin-lin

Looks good to me. @piiswrong Could you double check this PR?

* Improve bulking in Gluon * Trigger CI

Improve bulking in Gluon

3882654

ptrendx requested a review from anirudh2290 as a code owner January 15, 2019 20:54

Merge branch 'upstream' into pr_gluon_perf

b003188

marcoabreu added the pr-awaiting-review PR is waiting for code review label Jan 16, 2019

Trigger CI

ba821eb

eric-haibin-lin approved these changes Jan 24, 2019

View reviewed changes

eric-haibin-lin merged commit 89c7d57 into apache:master Jan 30, 2019

stephenrawls pushed a commit to stephenrawls/incubator-mxnet that referenced this pull request Feb 16, 2019

Improve bulking in Gluon (apache#13890)

3cb98a7

* Improve bulking in Gluon * Trigger CI

ptrendx mentioned this pull request Feb 19, 2019

Bulked op segments to allow Variable nodes #14200

Merged

7 tasks

vdantu pushed a commit to vdantu/incubator-mxnet that referenced this pull request Mar 31, 2019

Improve bulking in Gluon (apache#13890)

aa7179b

* Improve bulking in Gluon * Trigger CI

ptrendx mentioned this pull request Jun 18, 2019

Proper bulking of ops not using FCompute #15272

Merged

3 tasks

haohuanw pushed a commit to haohuanw/incubator-mxnet that referenced this pull request Jun 23, 2019

Improve bulking in Gluon (apache#13890)

ce524a2

* Improve bulking in Gluon * Trigger CI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve bulking in Gluon #13890

Improve bulking in Gluon #13890

ptrendx commented Jan 15, 2019 •

edited

Loading

ptrendx commented Jan 15, 2019

ptrendx commented Jan 15, 2019

stu1130 commented Jan 16, 2019

ptrendx commented Jan 16, 2019

eric-haibin-lin left a comment

Improve bulking in Gluon #13890

Improve bulking in Gluon #13890

Conversation

ptrendx commented Jan 15, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

ptrendx commented Jan 15, 2019

ptrendx commented Jan 15, 2019

stu1130 commented Jan 16, 2019

ptrendx commented Jan 16, 2019

eric-haibin-lin left a comment

Choose a reason for hiding this comment

ptrendx commented Jan 15, 2019 •

edited

Loading